home *** CD-ROM | disk | FTP | other *** search
- Path: pegasus.montclair.edu!harmon
- From: harmon@pegasus.montclair.edu (Derek Harmon)
- Newsgroups: comp.lang.c
- Subject: Re: unique id for a string
- Date: 18 Feb 1996 17:20:46 -0500
- Organization: Montclair State University
- Message-ID: <harmon.824680556@pegasus.montclair.edu>
- References: <1996Feb16.175601.114182@kuhub.cc.ukans.edu>
- NNTP-Posting-Host: pegasus.montclair.edu
- X-Newsreader: NN version 6.5.0 #68 (NOV)
-
- anh@kuhub.cc.ukans.edu writes:
- >This is not a quite a C problem. But since the implementation will be in
- >C anyway.
-
- Then I won't quite give an answer in C. :)
-
- >Given an alphabet, for example, E = {A-Z}, I need to assign an unique id
- >to each possible strings (finite, approx 32 chars long, or whatever, but
- >it is finite) formed by the letters from E. I would like the id to be as
- >small as possible. Is there a good formula?
-
- >HELLO = H * 26^4 + E * 26 ^3 + L * 26^2 + L * 26^1 + E * 26^0
-
- If the possible strings had to be English, a slight improvement (smaller
- id for more strings) could be attained by ordering the ASCII char values to
- alternate values for the corresponding character which are used less in the
- language (ie, vowels would be ASCII 1-5, S could be 6, and X could be 26).
-
- >But the id is way too big. I need something that fits within a 32-bits int
- >or a 64-bits long.
-
- >Another way is to simply assign a id to a word and store the word in a
- >binary search tree, and therefore, it is relatively fast to make sure
- >each word has an unique id. But I think it is still to slow. We are
- >talking about a possible 200,000+ words.
-
- How balanced the tree would be depends on either the order of input, or
- the tree's self-balancing of itself which makes input longer, but logarith-
- mically lowers the bound on search time. But we are talking about more than
- 200,000 words here, and that is where the problem occurs. If the max length
- of these strings is 32 characters, then we are talking about roughly 7.3 x
- 10^43 different strings. That will not fit into a 64-bit long. :)
-
- The formula you give above will give all those strings distinct id nums,
- if you have knowledge of the probability of occurence of symbols from that
- language, you can shift the probability of smaller id num size (there will
- still be large id nums) for a random sampling of those strings. However,
- if symbols have equal probabilities of selection, finnagling of the formula
- won't add any advantage. If it were possible, the world would be a path to
- your door for an infinite data compression method.
-
- -- Stone
- --
- # Derek Harmon (aka Stonelight) harmon@pegasus.montclair.edu
- # - Computer Science Undergrad, Montclair State University, NJ
- # - My views are my own, nobody else is this creative. 3;)>
- ... Recursive. adj., see Recursive.
-
-